23 



Conditioning is the issue 



James O. Berger 

Department of Statistical Science 
Duke University, Durham, NC 



The importance of conditioning in statistics and its implementation are high- 
lighted through the series of examples that most strongly affected my under- 
standing of the issue. The examples range from "oldies but goodies" to new 
examples that illustrate the importance of thinking conditionally in modern 
statistical developments. The enormous potential impact of improved handling 
of conditioning is also illustrated. 



23.1 Introduction 

No, this is not about conditioning in the sense of "I was conditioned to be a 
Bayesian." Indeed I was educated at Cornell University in the early 1970s, by 
Jack Kiefer, Jack Wolfowitz, Roger Farrell and my advisor Larry Brown, in 
a strong frcqucntist tradition, albeit with heavy use of prior distributions as 
technical tools. My early work on shrinkage estimation got me thinking more 
about the Bayesian perspective; doesn't one need to decide where to shrink, 
and how can that decision not require Bayesian thinking? But it wasn't until 
I encountered statistical conditioning (see the next section if you do not know 
what that means) and the Likelihood Principle that I suddenly felt like I had 
woken up and was beginning to understand the foundations of statistics. 

Bayesian analysis, because it is completely conditional (depending on the 
statistical model only through the observed data), automatically conditions 
properly and, hence, has been the focus of much of my work. But I never 
stopped being a frequentist and came to understand that frequentists can 
also appropriately condition. Not surprisingly (in retrospect, but not at the 
time) I found that, when frequentists do appropriately condition, they obtain 
answers remarkably like the Bayesian answers; this, in my mind, makes con- 
ditioning the key issue in the foundations of statistics, as it unifies the two 
major perspectives of statistics. 



254 



Conditioning is the issue 



The practical importance of conditioning arises because, when it is not 
done in certain scenarios, the results can be very detrimental to science. Un- 
fortunately, this is the case for many of the most commonly used statistical 
procedures, as will be discussed. 

This chapter is a brief tour of old and new examples that most influenced 
me over the years concerning the need to appropriately condition. The new 
ones include performing a sequence of tests, as is now common in clinical trials 
and is being done badly, and an example involving a type of false discovery rate. 



23.2 Cox example and a pedagogical example 

As this is more of an account of my own experiences with conditioning, I have 
not tried to track down when the notion first arose. Pierre Simon de Laplace 
likely understood the issue, as he spent much of his career as a Bayesian 
in dealing with applied problems and then, later in life, also developed fre- 
quentist inference. Clearly Ronald Fisher and Harold Jeffreys knew all about 
conditioning early on. My first introduction to conditioning was the example 
of Cox (1958). 

A variant of the Cox example: Every day an employee enters a lab to 
perform assays, and is assigned an unbiased instrument to perform the assays. 
Half of the available instruments are new and have a small variance of 1, while 
the other half are old and have a variance of 3. The employee is assigned each 
type with probability 1/2, and knows whether the instrument is old or new. 

Conditional inference: For each assay, report variance 1 or 3, depending on 
whether a new or an old instrument is being used. 

Unconditional inference: The overall variance of the assays is .5x l-t-.5x3 = 2, 
so report a variance of 2 always. 

It seems silly to do the unconditional inference here, especially when noting 
that the conditional inference is also fully frequentist; in the latter, one is just 
choosing different subset of events over which to do a long run average. 

The Cox example contains the essence of conditioning, but tends to be 
dismissed because of the issue of "global frequentism." The completely pure 
frequentist position is that one's entire life is a huge experiment, and so the 
correct frequentist average is over all possibilities in all situations involving 
uncertainty that one encounters in life. As this is clearly impossible, frequen- 
tists have historically chosen to condition on the experiment actually being 
conducted before applying frequentist logic; then Cox's example would seem 
irrelevant. However, the virtually identical issue can arise within an experi- 
ment, as demonstrated in the following example, first appearing in Berger and 
Wolpert (1984). 



J. 0. Berger 



255 



6 + 1 with probability 1/2, 
6—1 with probability 1/2. 



Pedagogical example: Two observations, Xi and X2, are to be taken, where 

Consider the confidence set for the unknown 6 e IR: 

r the singleton {{Xi + X^)!^] if X^ 7^ X2, 
oi^Ai, A2j I ^j^^ singleton {Xi - 1} if X^ = X2. 

The frequentist coverage of this confidence set is 

Pe{C{Xi,X2) contains 6} = .75, 

which is not at all a sensible report once the data is at hand. Indeed, if xi ^ X2, 
then we know for sure that (xi + X2)/2 is equal to 6, so that the confidence 
set is then actually 100% accurate. On the other hand, if xi = X2, we do not 
know whether 6 is the data's common value plus 1 or their common value 
minus 1, and each of these possibilities is equally likely to have occurred; 
the confidence interval is then only 50% accurate. While it is not wrong to 
say that the confidence interval has 75% coverage, it is obviously much more 
scientifically useful to report 100% or 50%, depending on the data. And again, 
this conditional report is still fully frequentist, averaging over the sets of data 
{{xi,X2) ■ xi ^ X2} and {{xi,X2) '■ xi = X2}, respectively. 



23.3 Likelihood and stopping rule principles 

Suppose an experiment E is conducted, which consists of observing data X 
having density f{x\6), where 6 is the unknown parameters of the statistical 
model. Let Xohs denote the data actually observed. 

Likelihood Principle (LP): The information about 6, arising from just E 
and Xobs, is contained in the observed likelihood function 

L{e)^ f{x,b,\6). 

Furthermore, if two observed likelihood functions are proportional, then they 
contain the same information about 6. 

The LP is quite controversial, in that it effectively precludes use of fre- 
quentist measures, which all involve averages of f{x\6) over x that are not 
observed. Bayesians automatically follow the LP because the posterior distri- 
bution of 6 follows from Bayes' theorem (with p{6) being the prior density for 
6) as 

^ piO)f(xobJO) 



256 



Conditioning is the issue 



which clearly depends on E and Xobs only through the observed likelihood 
function. There was not much attention paid to the LP by non-Bayesians, 
however, until the remarkable paper of Birnbaum (1962), which deduced the 
LP as a consequence of the conditionality principle (essentially the Cox exam- 
ple, saying that one should base the inference on the measuring instrument 
actually used) and the sufficiency principle, which states that a sufficient 
statistic for 9 in E contains all information about 9 that is available from the 
experiment. At the time of Birnbaum's paper, almost everyone agreed with 
the conditionality principle and the sufficiency principle, so it was a shock 
that the LP was a direct consequence of the two. The paper had a profound 
effect on my own thinking. 

There are numerous clarifications and qualifications relevant to the LP, and 
various generalizations and implications. Many of these (and the history of the 
LP) are summarized in Bergcr and Wolpert (1984). Without going further, 
suffice it to say that the LP is, at a minimum, a very powerful argument for 
conditioning. 

Stopping Rule Principle (SRP): The reasons for stopping experimentation 
have no bearing on the information about 9 arising from E and Zobs • 

The SRP is actually an immediate consequence of the second part of the 
LP, since "stopping rules" affect L(9) only by multiplicative constants. Serious 
discussion of the SRP goes back at least to Barnard (1947), who wondered why 
thoughts in the experimenter's head concerning why to stop an experiment 
should affect how we analyze the actual data that were obtained. 

Frcquentists typically violate the SRP. In clinical trials, for instance, it is 
standard to "spend a" for looks at the data — i.e., if there are to be interim 
analyses during the trial, with the option of stopping the trial early should 
the data look convincing, frcquentists view it to be mandatory to adjust the 
allowed error probability (down) to account for the multiple analyses. 

In Berger and Berry (1988), there is extensive discussion of these issues, 
with earlier references. The complexity of the issue was illustrated by a com- 
ment of Jimmy Savage: 

"I learned the stopping rule principle from Professor Barnard, in con- 
versation in the summer of 1952. Frankly, I then thought it a scan- 
dal that anyone in the profession could advance an idea so patently 
wrong, even as today I can scarcely believe that people resist an idea 
so patently right." (Savage et al., 1962) 



J. 0. Berger 



257 



The SRP does not say that one is free to ignore the stopping rule in any 
statistical analysis. For instance, common practice in some sciences, when 
testing a null hypothesis Hq, is to continue collecting data until the p- value 
satisfies p < .05 and then report the result as if no optional stopping had 
been involved. This is obviously bad science in that, even if Hq is true, one 
is guaranteed to obtain p < .05 if one just collects enough data. This fact 
was noted as early as 1938 by Berkson (Berkson, 1938), who humorously 
observed that, since one would be sure to obtain p < .05 in this way, we 
should save everyone trouble and the cost of experimentation and just declare 
every hypothesis rejected with p < .05! The correct calculation of a p-value 
would have to include the stopping rule used and the problem of "sampling 
to a foregone conclusion" would then disappear. 

What the SRP is saying is that methods of statistical inference should 
be used which arc compatible with the SRP. Bayesian analysis is compatible 
with the SRP; it will ignore the stopping rule and will not suffer for doing 
so. As but one illustration, in testing "Ho, a Bayesian would often use a Baycs 
factor B{X) (defined later) of to the alternative, which will not depend on 
the stopping rule. But Birnbaum (1962) observed that, for any stopping rule, 
Y'r{B{X) < e\Ho} < e, so that optional stopping cannot ensure that a small 
Bayes factor (small is evidence against "Ho) will be obtained. Surprisingly, 
there are also frequentist methods that are compatible with the SRP, but 
these are inevitably conditional frequentist methods. See Berger et al. (1999) 
and Berger et al. (1994) for examples. 



23.4 What it means to be a frequentist 

As we move to more complicated examples, it is necessary to define the fre- 
quentist principle of statistics. 

Frequentist Principle: In repeated practical use of a statistical procedure, 
the long-run average actual accuracy should not be less than ( and ideally should 
equal) the long-run average reported accuracy. 

Suppose, for instance, that a particular statistical model and procedure 
are to be repeatedly used — for instance, a 95% classical confidence interval 
for a Normal mean. This procedure will, in practice, be used on a series of 
different problems involving a series of different Normal means with different 
data. In evaluating the procedure, we should simultaneously be averaging over 
all possible practical instances of utilization of the procedure. 

Textbook statements of the frequentist principle tend to focus on fixing 
the value of, say, the Normal mean, and imagining repeatedly drawing data 
from the given model and utilizing the confidence procedure repeatedly on the 
different data draws. The word imagining is highlighted because this is solely 



258 



Conditioning is the issue 



a thought experiment. What is done in practice is to use the confidence proce- 
dure on a series of different problems — not a series of imaginary repetitions 
of the same problem with different data. 

Neyman himself often pointed out that the motivation for the frequentist 
principle is in its use on differing real problems; see, e.g., Neyman (1977). Of 
course, the reason textbooks typically give the imaginary repetition of an ex- 
periment version is because of the mathematical fact that if, say, a confidence 
procedure has 95% frequentist coverage for each fixed parameter value, then 
it will necessarily also have 95% coverage when used repeatedly on a series 
of differing problems. And, if the coverage is not constant over each fixed pa- 
rameter value, one can always find the minimum coverage over the parameter 
space, since it will follow that the real frequentist coverage in repeated use 
of the procedure on real problems will never be worse than this minimum 
coverage. 

Pedagogical example continued: Reporting 50% and 100% confidence, as 
appropriate, is fully frequentist, in that the long run reported coverage will 
average .75, which is the long run actual coverage. 

p-values: p-values are not frequentist measures of evidence in any long run 
average sense. Suppose we observe X, have a null hypothesis Hq, and construct 
a proper p- value p{X). Viewing the observed p{xobs) as a conditional error rate 
when rejecting Hq is not correct from the frequentist perspective. To see this, 
note that, under the null hypothesis, a proper p-value will be Uniform on the 
interval (0, 1), so that if rejection occurs when p{X) < a, the average reported 
p-value under Ho and rejection will be 



which is only half the actual long run error a. There have been other efforts to 
give a real frequentist interpretation of a p-value, none of them successful in 
terms of the definition at the beginning of this section. Note that the procedure 
{reject Hq when p{X) < a} is a fully correct frequentist procedure, but the 
stated error rate in rejection must be a, not the p-value. 

There have certainly been other ways of defining frequentism; see, e.g.. 
Mayo (1996) for discussion. However, it is only the version given at the begin- 
ning of the section that strikes me as being compelling. How could one want to 
give statistical inferences that, over the long run, systematically distort their 
associated accuracies? 




J. 0. Berger 



259 



23.5 Conditional frequentist inference 

23.5.1 Introduction 

The theory of combining the frequentist principle with conditioning was for- 
mahzed by Kiefer in Kiefer (1977), although there were many precursors to the 
theory initiated by Fisher and others. There are several versions of the theory, 
but the most useful has been to begin by defining a conditioning statistic S 
which measures the "strength of evidence" in the data. Then one computes 
the desired frequentist measure, but does so conditional on the strength of 
evidence S. 

Pedagogical example continued: S = \Xi — is the obvious choice, 
5 = 2 reflecting data with maximal evidential content (corresponding to the 
situation of 100% confidence) and 5 = 0 being data of minimal evidential 
content. Here coverage probability is the desired frequentist criterion, and an 
easy computation shows that conditional coverage given 5 is given by 

Pe{C(Xi, X2) contains 61 1 5 = 2} = 1, 
Pg{C{Xi,X2) contains 61 1 5 = 0} = 1/2, 

for the two distinct cases, which are the intuitively correct answers. 

23.5.2 Ancillary statistics and invariant models 

An ancillary statistic is a statistic 5 whose distribution does not depend on 
unknown model parameters 9. In the pedagogical example, 5 = 0 and 5 = 2 
have probability 1/2 each, independent of 0, and so 5 is ancillary. When ancil- 
lary statistics exist, they are usually good measures of the strength of evidence 
in the data, and hence provide good candidates for conditional frequentist in- 
ference. 

The most important situations involving ancillary statistics arise when the 
model has what is called a group-invariance structure; cf. Berger (1985) and 
Eaton (1989). When this structure is present, the best ancillary statistic to use 
is what is called the maximal invariant statistic. Doing conditional frequentist 
inference with the maximal invariant statistic is then equivalent to performing 
Bayesian inference with the right-Haar prior distribution with respect to the 
group action; cf. Berger (1985), Eaton (1989), and Stein (1965). 

Example— Location Distributions: Suppose Xi,...,Xn form a random 
sample from the location density f{xi — 6). This model is invariant under the 
group operation defined by adding any constant to each observation and 9; the 
maximal invariant statistic (in general) is 5 = {x^ — xi, — xi, . . . , a;„ — cci), 
and performing conditional frequentist inference, conditional on 5, will give 
the same numerical answers as performing Bayesian inference with the right- 



260 



Conditioning is the issue 



Haar prior, here simply given by p{9) ~ 1. For instance, the optimal condi- 
tional frequentist estimator of 9 under squared error loss would simply be the 
posterior mean with respect to p{6) = 1, namely 

which is also known as Pitman's estimator. 

Having a model with a group-invariance structure leaves one in an incred- 
ibly powerful situation, and this happens with many of our most common 
statistical problems (mostly from an estimation perspective). The difhculties 
of the conditional frequentist perspective are (i) finding the right strength of 
evidence statistic S, and (ii) carrying out the conditional frequentist computa- 
tion. But, if one has a group-invariant model, these difficulties can be bypassed 
because theory says that the optimal conditional frequentist answer is the an- 
swer obtained from the much simpler Bayesian analysis with the right-Haar 
prior. 

Note that the conditional frequentist and Bayesian answers will have dif- 
ferent interpretations. For instance both approaches would produce the same 
95% confidence set, but the conditional frequentist would say that the frequen- 
tist coverage, conditional on S (and also unconditionally), is 95%, while the 
Bayesian would say the set has probability .95 of actually containing 9. Also 
note that it is not automatically true that analysis conditional on ancillary 
statistics is optimal; see, e.g.. Brown (1990). 

23.5.3 Conditional frequentist testing 

Upon rejection of the Jio in unconditional Neyman-Pearson testing, one re- 
ports the same error probability a regardless of where the test statistic is in 
the rejection region. This has been viewed as problematical by many, and is 
one of the main reasons for the popularity of p-values. But as we saw earlier, 
p- values do not satisfy the frequentist principle, and so are not the conditional 
frequentist answer. 

A true conditional frequentist solution to the problem was proposed in 
Berger et al. (1994), with modification (given below) from Sellke et al. (2001) 
and Wolpert (1996). Suppose that we wish to test that the data X arises from 
the simple (i.e., completely specified) hypotheses Ho : / = /o or Hi : / = /i. 
The recommended strength of evidence statistic is 

S = max{po(x),pi(x)}, 

where po(x) is the p- value when testing Ho versus Hi, and pi(x) is the p- 
value when testing Hi versus Hq. It is generally agreed that smaller p- values 
correspond to more evidence against an hypothesis, so this use of p-values 
in determining the strength of evidence statistic is natural. The frequentist 



J. 0. Berger 



261 



conditional error probabilities (CEPs) are computed as 

a(s) = Pr(Type I error|5 = s) = Po{reject Ho|5(X) = s}, 

/3(s) = Pr(Type II error|5 = s) = Pi{accept •Ho|S'(X) = s}, ^ 

where Pq and Pi refer to probability under Ho and "Hi, respectively. 
The corresponding conditional frequentist test is then 

If Po < Pi, reject Hq and report Type I CEP a{s); , , 

If Po > Pi, accept Ho and report Type II CEP /3(s); 

where the CEPs are given in (23.1). 

These conditional error probabilities are fully frequentist and vary over the 
rejection region as one would expect. In a sense, this procedure can be viewed 
as a way to turn p-values into actual error probabilities. 

It was mentioned in the introduction that, when a good conditional fre- 
quentist procedure has been found, it often turns out to be numerically equiv- 
alent to a Bayesian procedure. That is the case here. Indeed, Berger et al. 
(1994) shows that 

a(s) =PrCHo|x), /?(s) = Pr(Hi |x) , (23.3) 

where Pr('Ho|x) and Pr('Hi|x) are the Bayesian posterior probabilities of Ho 
and Hi, respectively, assuming the hypotheses have equal prior probabilities 
of 1/2. Therefore, a conditional frequentist can simply compute the objective 
Bayesian posterior probabilities of the hypotheses, and declare that they are 
the conditional frequentist error probabilities; there is no need to formally 
derive the conditioning statistic or perform the conditional frequentist com- 
putations. There are many generalizations of this beyond the simple versus 
simple testing. 

The practical import of switching to conditional frequentist testing (or the 
equivalent objective Bayesian testing) is startling. For instance, ScUke ct al. 
(2001) uses a nonparametric setting to develop the following very general lower 
bound on a{s), for a given p- value: 

a{s) > ^ . (23.4) 

ep ln(p) 

Some values of this lower bound for common p- values are given in Table 23.1. 
Thus p ~ .05, which many erroneously think implies strong evidence against 
Ho, actually corresponds to a conditional frequentist error probability at least 
as large as .289, which is a rather large error probability. If scientists un- 
derstood that a p-value of .05 corresponded to that large a potential error 
probability in rejection, the scientific world would be a quite different place. 



262 



Conditioning is the issue 



TABLE 23.1 

Values of the lower bound a{s) in (23.4) for varfous values of p. 



p 


.2 


.1 


.05 


.01 


.005 


.001 


.0001 


.00001 


a{s) 


.465 


.385 


.289 


.111 


.067 


.0184 


.0025 


.00031 



23.5.4 Testing a sequence of hypotheses 

It is common in clinical trials to test multiple endpoints but to do so sequen- 
tially, only considering the next hypothesis if the previous hypothesis was a 
rejection of the null. For instance, the primary endpoint for a drug might be 
weight reduction, with the secondary endpoint being reduction in an aller- 
gic reaction. (Typically, these will be more biologically related endpoints but 
the point here is better made when the endpoints have little to do with each 
other.) Denote the primary endpoint (null hypothesis) by Hq, and the statis- 
tical analysis must first test this hypothesis. If the hypothesis is not rejected 
at level a, the analysis stops — i.e., no further hypotheses can be considered. 
However, if the hypothesis is rejected, one can go on and consider the sec- 
ondary endpoint, defined by null hypothesis Hq. Suppose this hypothesis is 
also rejected at level a. 

Surprisingly, the overall probability of Type I error (rejecting at least one 
true null hypothesis) for this procedure is still just a — see, e.g., Hsu and 
Berger (1999) — even though there is the possibility of rejecting two separate 
hypotheses. It appears that the second test comes "for free," with rejection 
allowing one to claim two discoveries for the price of one. This actually seems 
too remarkable; how can we be as confident that both rejections are correct 
as we are that just the first rejection is correct? 

If this latter intuition is not clear, note that one does not need to stop 
after two hypotheses. If the second has rejected, one can test Hq and, if that 
is rejected at level a, one can go on to test a fourth hypothesis ?^g, etc. Suppose 
one follows this procedure and has rejected Hq, . . . It is still true that 

the probability of Type I error for the procedure — i.e., the probability that 
the procedure will result in an erroneous rejection — is just a. But it seems 
ridiculous to think that there is only probability a that at least one of the 10 
rejections is incorrect. (Or imagine a million rejections in a row, if you do not 
find the argument for 10 convincing.) 

The problem here is in the use of the unconditional Type I error to judge 
accuracy. Before starting the sequence of tests, the probability that the pro- 
cedure yields at least one incorrect rejection is indeed, a, but the situation 
changes dramatically as we start down the path of rejections. The simplest 
way to see this is to view the situation from the Bayesian perspective. Consider 
the situation in which all the hypotheses can be viewed as a priori indepen- 
dent (i.e., knowing that one is true or false does not affect perceptions of the 



J. 0. Berger 



263 



others). If x is the overall data from the trial, and a total of m tests are 
ultimately conducted by the procedure, all claimed to be rejections (i.e., all 
claimed to correspond to the Hq being false), the Bayesian computes 
Pr (at least one incorrect rejection |x) 

= 1 — Pr(no incorrect rejectionsjx) 

m 

= l-n{l-P«|x)}, (23.5) 

i=l 

where Pr('Ho|x) is the posterior probability that "Hq is true given the data. 
Clearly, as m grows, (23.5) will go to 1 so that, if there are enough tests, 
the Bayesian becomes essentially sure that at least one of the rejections was 
wrong. From Section 23.5.3, recall that Bayesian testing can be exactly equiv- 
alent to conditional frequentist testing, so it should be possible to construct a 
conditional frequentist variant of (23.5). This will, however, be pursued else- 
where. 

While we assumed that the hypotheses are all a priori independent, it 
is more typical in the multiple endpoint scenario that they will be a priori 
related (e.g., different dosages of a drug). This can be handled within the 
Bayesian approach (and will be explored elsewhere), but it is not clear how 
a frequentist could incorporate this information, since it is information about 
the prior probabilities of hypotheses. 



23.5.5 True to false discovery odds 

A very important paper in the history of genome wide association studies (the 
effort to find which genes are associated with certain diseases) was Burton 
et al. (2007). Consider testing Ho : 6* = 0 versus an alternative Hi : 9 0, 
with rejection region TZ and corresponding Type I and Type II errors a and 
P{6). Let p{d) be the prior density of 9 under Hi, and define the average power 

1-^ = j{i-m}pmo. 

Frequentists would typically just pick some value 9* at which to evaluate the 
power; this is equivalent to choosing p{9) to be a point mass at 9* . 

The paper observed that, pre-cxperimentally, the odds of correctly reject- 
ing Hq to incorrectly rejecting are 

Opre = - X , (23.6) 

TTo a 

where ttq and tti — 1 — ttq are the prior probabilities of Hq and Hi- The 
corresponding false discovery rate would be (1 -I- Opro)"^- 

The paper went on to assess the prior odds tti/tto of a genome/disease 
association to be 1/100,000, and estimated the average power of a GWAS 



264 



Conditioning is the issue 



test to be .5. It was decided that a discovery should be reported if Opro > 10, 
which from (23.6) would require a < 5 x 10~^; this became the recommended 
standard for significance in GWAS studies. Using this standard for a large data 
set, the paper found 21 genome/disease associations, virtually all of which have 
been subsequently verified. 

An alternative approach that was discussed in the paper is to use the 
posterior odds rather than pre-experimental odds — i.e., to condition. The 
posterior odds are 

Cipost(x) - X ^^^1^^ , [26.7) 

where m{x\Hi) = / f{x\9)p{9)d9 is the marginal likelihood of the data x 
under "Hi. (Again, this prior could be a point mass at 6* in a frequentist 
setting.) It was noted in the paper that the posterior odds for the 21 claimed 
associations ranged between 1/10 (i.e., evidence against the association being 
true) to 10^^ (overwhelming evidence in favor of the association). It would 
seem that these conditional odds, based on the actual data, are much more 
scientifically informative than the fixed pre-experimental odds of 10/1 for the 
chosen a, but the paper did not ultimately recommend their use because it 
was felt that a frequentist justification was needed. 

Actually, use of Opost is as fully frequentist as is use of Opj-o, since it 
is trivial to show that E{Opost{^)\'Ho,'R-} = Opro, i-e., the average of the 
conditional reported odds equals the actual pre-experimental reported odds, 
which is all that is needed to be fully frequentist. So one can have the much 
more scientifically useful conditional report, while maintaining full frequentist 
justification. This is yet another case where, upon getting the conditioning 
right, a frequentist completely agrees with a Baycsian. 



23.6 Final comments 

Lots of bad science is being done because of a lack of recognition of the 
importance of conditioning in statistics. Overwhelmingly at the top of the list 
is the use of p-values and acting as if they are actually error probabilities. 
The common approach to testing a sequence of hypotheses is a new addition 
to the list of bad science because of a lack of conditioning. The use of pre- 
experimental odds rather than posterior odds in GWAS studies is not so much 
bad science, as a failure to recognize a conditional frequentist opportunity 
that is available to improve science. Violation of the stopping rule principle 
in sequential (or interim) analysis is in a funny position. While it is generally 
suboptimal (for instance, one could do conditional frequentist testing instead), 
it may be necessary if one is committed to certain inferential procedures such 
as fixed Type I error probabilities. (In other words, one mistake may require 
the incorporation of another mistake.) 



J. 0. Berger 



265 



How does a frequentist know when a serious conditioning mistake is being 
made? Wc have seen a number of situations where it is clear but. in general, 
there is only one way to identify if conditioning is an issue — Bayesian analysis. 
If one can find a Bayesian analysis for a reasonable prior that yields the same 
answer as the frequentist analysis, then there is probably not a conditioning 
issue; otherwise, the conflicting answers are probably due to the need for 
conditioning on the frequentist side. 

The most problematic situations (and unfortunately there are many) are 
those for which there exists an apparently sensible unconditional frequentist 
analysis but Bayesian analysis is unavailable or too difficult to implement given 
available resources. There is then not much choice but to use the unconditional 
frequentist analysis, but one might be doing something silly because of not 
being able to condition and one will not know. The situation is somewhat 
comparable to seeing the report of a Bayesian analysis but not having access 
to the prior distribution. 

While I have enjoyed reminiscing about conditioning, I remain as perplexed 
today as 35 years ago when I first learned about the issue; why do wc still not 
treat conditioning as one of the most central issues in statistics? 



References 

Barnard, G.A. (1947). A review of 'Sequential Analysis' by Abraham Wald. 
Journal of the American Statistical Association^ 42:658-669. 

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. 
Springer, New York. 

Berger, J.O. and Berry, D.A. (1988). The relevance of stopping rules in sta- 
tistical inference. In Statistical Decision Theory and Related Topics IV, 
1. Springer, New York, pp. 29-47. 

Berger, J.O., Boukai, B., and Wang, Y. (1999). Simultaneous Bayesian- 
frequentist sequential testing of nested hypotheses. Biometrika, 86:79-92. 

Berger, J.O., Brown, L.D., and Wolpert, R.L. (1994). A unified conditional 
frequentist and Bayesian test for fixed and sequential simple hypothesis 
testing. The Annals of Statistics, 22:1787-1807. 

Berger, J.O. and Wolpert, R.L. (1984). The Likelihood Principle. IMS Lec- 
ture Notes, Monograph Series, 6. Institute of Mathematical Statistics, 
Hay ward, CA. 

Berkson, J. (1938). Some difficulties of interpretation encountered in the 
application of the chi-square test. Journal of the American Statistical 
Association, 33:526-536. 



266 



Conditioning is the issue 



Birnbaum, A. (1962). On the foundations of statistical inference. Journal of 
the American Statistical Association, 57:269-306. 

Brown, L.D. (1990). An ancillarity paradox which appears in multiple linear 
regression. The Annals of Statistics, 18:471-493. 

Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N., Deloukas, P., 
Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W.H., 
Samani, N.J., et al. (2007). Genome- wide association study of 14,000 cases 
of seven common diseases and 3,000 shared controls. Nature, 447:661-678. 

Cox, D.R. (1958). Some problems connected with statistical inference. The 
Annals of Mathematical Statistics, 29:357-372. 

Eaton, M.L. (1989). Group Invariance Applications in Statistics. Institute of 
Mathematical Statistics, Hayward, CA. 

Hsu, J.C. and Bergcr, R.L. (1999). Stepwise confidence intervals without 
multiplicity adjustment for dose-response and toxicity studies. Journal of 
the American Statistical Association, 94:468-482. 

Kiefer, J. (1977). Conditional confidence statements and confidence estima- 
tors. Journal of the American Statistical Association, 72:789-808. 

Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge. Uni- 
versity of Chicago Press, Chicago, IL. 

Neyman, J. (1977). Frequentist probability and frequentist statistics. Syn- 
these, 36: 97-131. 

Savage, L.J., Barnard, C, Cornfield, J., Bross, I., Box, G.E.P., Good, I.J., 
Lindley, D.V., Clunies-Ross, C.W., Pratt, J.W., Levene, H. et al. (1962). 
On the foundations of statistical inference: Discussion. Journal of the 
American Statistical Association, 57:307-326. 

Sellke, T., Bayarri, M., and Berger, J.O. (2001). Calibration of p-values for 
testing precise null hypotheses. The American Statistician, 55:62-71. 

Stein, C. (1965). Approximation of improper prior measures by prior prob- 
ability measures. In Bernoulli-Bayes-Laplace Festschrift. Springer, New 
York, pp. 217-240. 

Wolpert, R.L. (1996). Testing simple hypotheses. In Studies in Classification, 
Data Analysis, and Knowledge Organization, Vol. 7. Springer, New York, 
pp. 289-297. 



